Source code files 



Structure profiles 



Similar pairs 



Create sequences of tokens and 
structure metrics to form program ^ 
structure profiles. 



101 



I 



Compare structure profiles to find /~ 
similar code structures. 



102 



I 



Compare token sequences within 
matching source code structures 
using a variant of the Longest 
Common Subsequence (LCS) 
algorithm. 



103 

► Indices H, HT 



Figure 1 
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Source code files 




Token file pairs 



Remove comments and string 
constants 




Translate upper-case letters to 
lower-case. 




Map synonyms to a common form. 




Reorder the functions into their 
calling order. 




Remove all tokens that are not 
specific programming language 
keywords. 



I 




201 



203 



204 



205 



206 



207 




202 

^ Matching pairs 



Figure 2 
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Source code files 



Token 
sequences 



Phase 1 J 

Remove whitespace, comments, 
and identifier names. 




Replace remaining language 
statements by tokens 



mi 



Compare token sequences using 
Greedy String Tiling algorithm. 



t0X 




304 



302 



Matching pairs 



Figure 3 



Source code files 



k-grams 



Fingerprints 



Remove whitespace and 
punctuation from file and convert all 
characters to lower case. 



401 



I 



Divide the remaining non- 
whitespace characters of each file 
into k-grams. 



I 



Hash each k-gram and select a 
subset of all k-grams to be the 
fingerprints of the document. 



Compare document fingerprints. 



402 



403 



404 



Matching pairs 



Figure 4 
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She loves you yeah, yeah, yeah, 
(a) Some text. 



501 



shelo helov elove loves ovesy vesyo esyou syouy youye 
ouyea uyeah yeahy eahye ahyea hyeah yeahy eahye ahyea 
hyeah 

(b) The sequence of 5-grams derived from the text 



502 



77 72 42 17 98 50 23 55 6 66 34 24 39 11 84 24 39 11 84 
(c) A hypothetical sequence of hashes of the 5-gram. 




72 24 84 24 84 

(d) The fingerprints - selecting only those hashes that are 0 mod 4. 

Figure 5 



routine 



// file name 
// path 



/+ begin 

void fdiv( 

char *fname, 

char *path) 

{ 

int Indexl, j; 

while (1) 

j = strlen (fname) ; 
/* find the file extension */ 

(a) C source code snippet for file 1. 



SourceLinesl [O; 
SourceLinesl [i; 
SourceLinesl [2, 
SourceLinesl [3; 
SourceLinesl [ 4 ; 
SourceLinesl [5; 
SourceLinesl [6] 
SourceLinesl [7; 
SourceLinesl [ 8 [ 
SourceLinesl [9] = 

(b) Source code and comment line arrays 



= ''void fdiv" 
= "char fname" 
= "char path" 

= "int Indexl j' 



^Vhile 1" 
strlen 



fname" 
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Coinitient Lines 1 [0] 
Comment Lines 1 [ 1 ] 
CommentLinesl [2] 
CommentLinesl [3] 
CommentLinesl [ 4 ] 
Coraiaent Lines 1 [5] 
CoinmentLinesl [6] 
CommentLinesl [7] 
CommentLinesl [8] 
CommentLinesl [9] 

for file 1. 



602 



= " begin routine " 

\\// 

= "file name" 

= "path" 

^ Wff 

— "« 

_ M// 

_ M// 

^ Wft 

= "find the file extension" 



Wordl[0] = "fdiv" 

Wordl[l] = "fname" 

Wordl[2] = "path" 603 

Wordl[3] = "Indexl" I 

(c) Array of unique identifiers (non-keywords) in file 1. 



Figure 6 
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Wordl[0] = ''^abc" 


Word2[0] = 


"Aabc" 


Wordl[l] = "abcl" 


Word2[l] = 




Wordl[2] = "abcl23" 


Word2[2] = 


^^abclllllll 


Wordl[3] = "abcdef" 


Word2[3] = 


"^abcXXXyz" 


Wordl[4] = "pdq" 


Word2[4] = 


%\ // 


Wordl[5] = "XXX" 


Word2[5] = 


\\ j fi 


Wordl[6] = "xyz" 


Word2[6] = 


^^pdq" 


Wordl[7] = ^^yyy" 


Word2[7] = 




(a) Non-keyword words 


in files 1 and 2. 





PartialWord[0] = ^^abc" 

PartialWord[l] = "abcl" 

PartialWord[2] = "xxx" 

PartialWord[3] = "xyz" 

(b) Matching partial words 



Figure 7 



File 1 






File 2 


1 /* begin routine */ 


1 /* find the file extension */ 


2 void fdiv( 


2 void file divide ( 


3 char *fname, // file name 


3 char 


* fname. 


4 char *path) /* path */ 


4 char 


*path) 


5 ( 


5 { 




6 int Indexl, j; 


6 


int 


i/ j; 


7 


7 


while (1) // loop here 


8 while (1) 


8 


j 


= strlen ( fname ) ; 


9 j = strlen (fname) ; 


9 






10 // find the file extension 


10 






(a) Two iiles. 






801 


3/3 






4/4 








9/8 ^"H^ 








(b) Matching source lines in filel/file2 






802 







Figure 8 
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File 1 






Pile 2 


1 /* begin routine */ 


1 


/* find the file extension */ 


2 void fdiv( 


2 


void file_divide( 


3 char *fname, // file name 


3 


char * fname. 


4 char *path) /* path */ 


4 


char *path) // path 


5 { 


5 


{ 




6 int Indexl, j; 


6 


int i, j; /* 


begin routine */ 


7 


7 


while (1) 


// loop here 


8 while (1) 


8 


j = strlen (fname) ; 


9 j = strlen (fname) ; 


9 






10 / / "Ft nH i"Vi#a "Fi 1 ^ iav1"onQT r\r\ 


1 n 


switch (x) 




11 if (X = 5) { 


11 


{ 




(a) Two files. 




\ 


901 


1/6 








4/4 -.........^.^^^^ 








10/1 — 






902 


(b) Matching comment lines in filel/file2 



Figure 9 
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Source code files 



Source line 
arrays 



Comment line 
arrays 



Word anrays 



Word an'ays 



Source line 
arrays 



Match scores 



Create source line array, comment 
line array, word array. 



Source Line Matching 



Comment Line Matching 



I 



Word matching 



Partial Word Matching 



I 



Semantic Sequence Matching 



Combine all scores. 



1002 



Matching 
-> source lines 



1003 



Matching 
comment lines 



1004 



Matching 
words 




1005 

Matching 
partial words 

1006 



Matching 
^ semantic 



sequences 
1007 



■> Total match 
score 



Figure 10 
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Comparing files in folder D:\CodeMatch\Code Deve!opment\test\C test 2\files 1 
To files in folder D:\CodeMatch\Code Development\test\C test 2\files 2 ^ ^ 

D:\CodeMatcli\Code Development\test\C test 2\files l\bpf_dump.c y — I 1 102 
Match Score Compared To File 

2910 D:\CodeMatch\Code Development\test\C test 2\files 2\bpf_dump.c 

374 D:\CodeMatch\Code Development\test\C test 2\files 2\W32NRegx 

374 D:\CodeMatch\Code Developinent\test\C test 2\files 2\test\W32NReg (variable names changed).c 
224- D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (no comments).c 

D:\CodeMatch\Code Developnient\test\C test 2Vfiles l\bpf_filter.c y 1 1 103 

Match Score Compared To File 

6Q6 D:\CodeMatch\Code Development\test\C test 2\files 2\W32NReg.c 

606 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (no comments).c 

572 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (variable names changed).c 

398 D:\CodeMatch\Code Development\test\C test 2\files 2\bpf_dump.c 



Figure 11 
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Comparing filel: D:\CodeMatch\\tes^C test2\files l\bpf_dump.c 
To file2: D:\CodeMatch\ C test 2\files 2\test\W32Nreg.c 



1201 



Filel Line# 


FUe2 Line# 


Source line 








21 


1 


#include <windows.h> 




1202 




22 


3 


#inclucie <stdio.h> 








24 


7 


#include "WiNDIS.h" 






1203 


Matching comment lines: 






Filel Line# 


File2 Line# 


Comment line 


3 


3 


* The Regents of the University of California. All rights reserved. 


10 


5 


* Redistribution and use in source and binary forms, with or without 



Longest matching semantic sequence: 

Filel Line# Filel Line# Number of matching lines 

21 1 3 

Matching words: 



1204 



1205 



stdio 


WiNDIS 


windows 


Matching partial words: ^ ^ 


1206 


Ox 


windows 



Figure 12 
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